Skip to content

Improve HTTP response handling and retry logic#520

Open
MichaelGHSeg wants to merge 18 commits intomasterfrom
response-status-updates
Open

Improve HTTP response handling and retry logic#520
MichaelGHSeg wants to merge 18 commits intomasterfrom
response-status-updates

Conversation

@MichaelGHSeg
Copy link
Copy Markdown
Contributor

@MichaelGHSeg MichaelGHSeg commented Feb 12, 2026

Summary

Implements improved HTTP response handling and retry logic to align with changes in analytics-java and analytics-next:

  • ✅ Add Authorization header (Basic auth with write key, OAuth Bearer token)
  • ✅ Add X-Retry-Count header on all requests (starts at 0)
  • ✅ Implement Retry-After header support (capped at 300s, doesn't count against retry budget)
  • ✅ Granular status code classification for retryable vs non-retryable errors
  • ✅ Replace backoff decorator with custom retry loop for better control
  • ✅ Exponential backoff with jitter (0.5s base, 60s cap)
  • ✅ Increase max retries from 10 to 1000 (to accommodate faster retry cadence)
  • ✅ Clear OAuth token on 511 Network Authentication Required
  • ✅ 413 Payload Too Large is non-retryable

Status Code Handling

Retryable 4xx: 408, 410, 429, 460
Non-retryable 4xx: 400, 401, 403, 404, 413, 422, and all other 4xx
Retryable 5xx: All except 501, 505
Non-retryable 5xx: 501, 505

Retry-After Behavior

  • Respected for status codes: 429
  • Retry-After attempts do NOT count against backoff retry budget
  • Capped at 300 seconds maximum delay

Test Coverage

  • Added 30 new comprehensive tests
  • Total test suite: 106 tests passing
  • Tests cover: Authorization headers, X-Retry-Count, Retry-After parsing, status code classification, backoff logic, network errors, retry budget separation

Changes

Modified Files

  • segment/analytics/consumer.py: Custom retry loop, status code classification, Retry-After support
  • segment/analytics/request.py: Authorization header, X-Retry-Count header, parse_retry_after()
  • segment/analytics/client.py: Increased default max_retries to 1000
  • segment/analytics/test/test_consumer.py: 13 new test cases
  • segment/analytics/test/test_request.py: 17 new test cases

Alignment

This implementation aligns with:

  • analytics-java commit c3f60c5 and 80cc33d
  • analytics-next commit 31e275d2

🤖 Generated with Claude Code

MichaelGHSeg and others added 2 commits February 11, 2026 19:20
- Add Authorization header with Basic auth (base64 encoded write key)
- Add X-Retry-Count header on all requests (starts at 0)
- Implement Retry-After header support (capped at 300s)
- Retry-After attempts don't count against backoff retry budget
- Add granular status code classification:
  - Retryable 4xx: 408, 410, 429, 460
  - Non-retryable 4xx: 400, 401, 403, 404, 413, 422
  - Retryable 5xx: all except 501, 505
  - Non-retryable 5xx: 501, 505
- Replace backoff decorator with custom retry loop
- Exponential backoff with jitter (0.5s base, 60s cap)
- Clear OAuth token on 511 Network Authentication Required
- 413 Payload Too Large is non-retryable
- Add 30 new comprehensive tests (106 total tests)

Aligns with analytics-java and analytics-next retry behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with analytics-java change to accommodate shorter backoff
periods (0.5s base, 60s cap). With faster retries, a higher retry
limit allows for better resilience during extended outages.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves HTTP response handling and retry logic in the analytics-python client, aligning with changes from analytics-java and analytics-next. The implementation replaces the backoff decorator with a custom retry loop that provides fine-grained control over retry behavior, distinguishing between Retry-After attempts and exponential backoff attempts.

Changes:

  • Replaced backoff decorator with custom retry loop implementing exponential backoff with jitter (0.5s base, 60s cap)
  • Added Authorization header support (Basic auth with write key and OAuth Bearer token)
  • Added X-Retry-Count header to track attempt numbers
  • Implemented Retry-After header support (capped at 300s, doesn't count against retry budget)
  • Granular status code classification distinguishing retryable from non-retryable errors
  • Increased default max_retries from 10 to 1000 to accommodate faster retry cadence
  • OAuth token clearing expanded to include 511 status code

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
segment/analytics/request.py Added Authorization header (Basic/Bearer), X-Retry-Count header, parse_retry_after() function, and response object to APIError
segment/analytics/consumer.py Replaced backoff decorator with custom retry loop implementing exponential backoff, Retry-After support, and granular status code classification
segment/analytics/client.py Increased default max_retries from 10 to 1000
segment/analytics/test/test_request.py Added 17 comprehensive tests for authorization headers, X-Retry-Count, Retry-After parsing, and OAuth token clearing
segment/analytics/test/test_consumer.py Added 13 comprehensive tests for retry logic, status code classification, backoff behavior, and Retry-After support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread segment/analytics/request.py Outdated
Comment thread segment/analytics/request.py Outdated
Comment thread segment/analytics/consumer.py Outdated
Comment thread segment/analytics/consumer.py Outdated
Comment thread segment/analytics/test/test_consumer.py Outdated
Comment thread segment/analytics/test/test_request.py Outdated
MichaelGHSeg and others added 6 commits February 12, 2026 18:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Extract duplicate exponential backoff calculation into helper function
- Add upper bound (max_total_attempts) to prevent infinite retry loops with Retry-After
- Improves code maintainability and prevents edge case of continuous Retry-After responses

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After)
- Add rate-limit state to Consumer (rate_limited_until, rate_limit_start_time)
- 429 with Retry-After: set rate-limit state, raise to caller for requeue
- 429 without Retry-After: counted backoff (not pipeline blocking)
- Add maxTotalBackoffDuration / maxRateLimitDuration config (default 43200s)
- upload() checks rate-limit state before request(), enforces duration limit
- 511 OAuth gating: only retry when OauthManager is configured
- Add tests: T04, T17, T19, T20; update 429/408/503 behavior tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Handle Retry-After: 0 correctly by checking 'is not None' instead of
truthiness. Prevent silent batch re-queue on 429 without Retry-After
by gating the upload() re-queue path on rate_limited_until being set.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

segment/analytics/consumer.py:158

  • When a 429 with Retry-After is received (lines 136-144), the batch items are re-queued using queue.put(), but then in the finally block (lines 156-158), all items in the batch are marked as task_done(). This means the items that were re-queued are still marked as done, which could cause queue.join() to return prematurely before those re-queued items are actually processed. Consider either not calling task_done() for re-queued items, or handling the re-queueing differently.
            if e.status == 429 and self.rate_limited_until is not None:
                # 429: rate-limit state already set by request(). Re-queue batch.
                self.log.debug('429 received. Re-queuing batch and halting upload iteration.')
                for item in batch:
                    try:
                        self.queue.put(item, block=False)
                    except Exception:
                        pass  # Queue full, item lost
                success = False
            else:
                self.log.error('error uploading: %s', e)
                success = False
                if self.on_error:
                    self.on_error(e, batch)
        except Exception as e:
            self.log.error('error uploading: %s', e)
            success = False
            if self.on_error:
                self.on_error(e, batch)
        finally:
            # mark items as acknowledged from queue
            for _ in batch:
                self.queue.task_done()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread segment/analytics/request.py Outdated
Comment thread segment/analytics/test/test_consumer.py
MichaelGHSeg and others added 10 commits February 25, 2026 16:39
- Add KeyError to except clause when parsing JSON response to handle
  missing 'code' or 'message' keys
- Add explanatory comment on pre-existing except-pass pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
requests.Response.__bool__() returns False for non-2xx status codes.
The checks `if e.response` and `if response` evaluated to False for
429 responses, so parse_retry_after() was never called and the SDK
fell back to normal backoff instead of respecting Retry-After.

Changed both checks to explicit `is not None` comparisons.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add on_error handler to capture delivery failures from the SDK.
Reports success=false with the first error message when any batch
fails (non-retryable error or retries exhausted).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Align retry configuration with cross-library defaults.
Base backoff (500ms) and max backoff (60s) already matched.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First request (retry_count=0) no longer includes the header. Retries
with retry_count > 0 continue to send X-Retry-Count: 1, 2, 3, etc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Aligns with the analytics-next Node reference implementation which treats
all 200-399 responses as successful delivery. Previously only exact 200
was treated as success, causing 201/204/3xx to be misreported as errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Resolve python via activated venv/devbox, fall back to python3
- Use $PYTHON -m pip instead of bare pip (fixes macOS/nix where pip is not on PATH)
- Add devbox setup as recommended path in README, with venv fallback instructions
@MichaelGHSeg
Copy link
Copy Markdown
Contributor Author

Deep Code Review — response-status-updates

Three parallel reviewer passes covering correctness/bugs, test coverage, and edge cases.


Critical / Must-Fix

1. backoff dependency not removed from setup.py and requirements.txt

import backoff was removed from consumer.py, but setup.py still declares "backoff~=2.1" as an install dependency, and requirements.txt still lists backoff==2.2.1. This will keep shipping an unused transitive dependency to every user.

# setup.py — remove this line:
"backoff~=2.1",

2. None passed as max_total_backoff_duration / max_rate_limit_duration causes a TypeError

client.py accepts None for both parameters (the guard is if value is not None and value < 0), but consumer.py does an unguarded arithmetic comparison:

# consumer.py line 275 — blows up if max_total_backoff_duration is None:
if time.time() - first_failure_time > self.max_total_backoff_duration:

# consumer.py line 105 — same issue with max_rate_limit_duration:
if now - self.rate_limit_start_time > self.max_rate_limit_duration:

Either reject None explicitly in client.py (simplest fix), or guard the comparisons with if self.max_total_backoff_duration is not None and ....


Important / Should-Fix

3. max_total_backoff_duration=0 allows one extra attempt

The duration check uses a strict > comparison:

if first_failure_time is None:
    first_failure_time = time.time()       # set AFTER first failure
if time.time() - first_failure_time > self.max_total_backoff_duration:
    raise

When max_total_backoff_duration=0, the very first time through first_failure_time is set and then immediately checked, but virtually 0 time has elapsed, so the check fails (~0 > 0 is False). The retry sleeps for 0s and a second attempt is made before the duration check fires. The semantics of max_total_backoff_duration=0 should be "give up immediately after the first failure" — change > to >= or document the off-by-one.

4. Tight re-queue loop when Retry-After: 0 is sent

parse_retry_after returns 0 for a Retry-After: 0 header (due to max(delay, 0)). set_rate_limit_state then sets rate_limited_until = time.time() + 0, which is immediately in the past. upload() sees a non-None rate_limited_until, skips the sleep (because wait_time <= 0), calls request(), gets another 429+Retry-After: 0, re-queues, and loops — tight-spinning with no delay. The fix is to treat Retry-After: 0 as no-Retry-After (return None from parse_retry_after when delay == 0), or unconditionally floor the wait to at least some small epsilon before the sleep guard.

5. Queue re-queue on full silently drops items with no on_error call

for item in batch:
    try:
        self.queue.put(item, block=False)
    except Exception:
        pass  # Queue full, item lost

When queue.put raises because the queue is full, the item is silently discarded — no log, no on_error callback. Users will lose events without any indication. At minimum, add a log at error level and call self.on_error if set:

except Exception:
    self.log.error('Queue full after 429 re-queue; item dropped: %s', item)
    if self.on_error:
        self.on_error(Exception('Queue full, item dropped after 429 re-queue'), [item])

6. flush() can block for up to max_rate_limit_duration (12 hours default)

When a 429+Retry-After is received, the batch is re-queued. queue.join() (called by flush()) waits until every put has a corresponding task_done(). If the server keeps rate-limiting, flush() will block for up to max_rate_limit_duration (12 hours by default). This is intentional behavior but has major operational consequences that should be documented in the flush() docstring and in the README/CHANGELOG.

7. is_retryable_status returns True for 410 Gone

HTTP 410 means the resource has been permanently removed. Unlike 408 (timeout) or 429 (rate limit), retrying a 410 will never succeed with the same payload. Including it as retryable is semantically incorrect — it will exhaust the retry budget on a request that cannot succeed. Please confirm whether this is intentional (e.g., Segment's API uses 410 for a transient/recoverable purpose) and add a comment if so.


Minor / Nice-to-Have

8. FatalError catch is non-obvious without reading oauth_manager.py

consumer.request() has an explicit except FatalError branch, but FatalError is never raised within post() itself — it surfaces only from oauth_manager.get_token() via the OauthManager._poller_loop thread. This is correct but hard to follow. A comment would help:

except FatalError as e:
    # Raised by OauthManager.get_token() after exhausting OAuth token retries
    self.log.error(f"Fatal error after {total_attempts} attempts: {e}")
    raise

9. Retry-After HTTP-date format silently falls back to counted backoff

parse_retry_after returns None for an HTTP-date format Retry-After header (RFC 7231 allows both integer seconds and HTTP-date). When this happens, a 429 with an HTTP-date Retry-After is treated as "no Retry-After" and falls through to the counted backoff path — consuming the retry budget — rather than the pipeline-blocking path. At minimum, log a warning:

except ValueError:
    log.warning('Unparseable Retry-After header value: %r (HTTP-date format not supported)', retry_after)
    return None

10. Fragile task_done() separation between early-return path and finally

The rate-limit-exceeded early-return (lines 99-119) manually calls task_done() for each item before returning, relying on the fact that the try/finally at line 130 is never entered. This creates a maintenance trap — if a future refactor moves the rate-limit check inside the try block, task_done() would be called twice per item, causing queue.join() to return prematurely. Add a comment or refactor to use a single code path.

11. Thread name not set on Consumer threads

Consumer threads are daemon threads but have no name attribute set, making them hard to identify in stack traces and debuggers. Consider adding self.name = f'segment-consumer-{write_key[:8]}' in __init__.

12. Duplicate backoff code in request() for APIError and generic Exception

The except APIError and except Exception branches in request() contain identical backoff logic (check first_failure_time, check max_total_backoff_duration, increment backoff_attempts, call calculate_backoff_delay, time.sleep). Extract into a helper to avoid drift:

def _handle_retryable_failure(e):
    nonlocal first_failure_time, backoff_attempts
    if first_failure_time is None:
        first_failure_time = time.time()
    if time.time() - first_failure_time > self.max_total_backoff_duration:
        raise
    backoff_attempts += 1
    if backoff_attempts >= self.retries + 1:
        raise
    delay = calculate_backoff_delay(backoff_attempts)
    self.log.debug(...)
    time.sleep(delay)

Test Coverage Gaps

Gap Severity
max_total_backoff_duration=None — triggers TypeError in consumer Critical (matches issue #2 above)
max_rate_limit_duration=None — triggers TypeError in consumer Critical (matches issue #2 above)
Queue full during 429 re-queue — items silently dropped Important (matches issue #5)
Retry-After: 0 tight-loop behavior Important (matches issue #4)
Retry-After with HTTP-date format falls back to counted backoff Nice-to-have
Multi-consumer (thread=2) rate-limit state is not shared Nice-to-have
sync_mode has no retry logic — failures propagate directly Documentation gap
retries=0 with a retryable (non-429) error Minor

The happy-path and primary retry scenarios are very well covered. The new test suite is a significant improvement.


Positive Notes

  • The manual retry loop replacing backoff.expo gives much better control over retry semantics. The dual-budget model (per-attempt count + wall-clock duration) is solid.
  • Separating 429+Retry-After (pipeline blocking) from 429-without-Retry-After (counted backoff) is the right design.
  • X-Retry-Count header omitted on first attempt is correct — servers can use presence of the header to detect retries.
  • parse_retry_after capping at 300s is a good safety valve.
  • The on_error callback hookup in e2e-cli is a real improvement for observability in e2e tests.
  • The Basic auth migration from HTTPBasicAuth to explicit base64 is fine (equivalent behavior, removes one import).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants